Airbnb Data Analytics

In our previous post, we have wrote on how to connect to MySQL Server and insert you data into the database. In this article, we will explore the data using SQL query in python.

The following is the list of libraries that we will use throughout the notebook.

Connect to MySQL Server

Let's first connect to the MySQL Server that we have set up previously.

Let's check the available table in our database.

We currently have 2 different tables with the following properties:

Research Question

It's better for us to define what kind of insight that we want to draw from the database. Here, I define my research questions. These questions just to give you the illustration on how to do data analysis, so they may not deep enough to give any recommendation. You can come up with your questions as well.

Room Listing

Host

Room Listing

Common Room Type

Let's start with our first question, what is the most common room type and property type available? We can acquire this information from the listing table.

First, let's check how many room listing we have.

Now let's check the number of data for each Room type.

We can use the COUNT(*) to count the number of rows and use GROUP BY to group the data based on the room type. To get the top 15, we need to arrange the data by using ORDER BY and set the sorting to descending (from largest to smallest) using the DESC.

Let's create a simple bar chart to visualize this result.

The most common listing available is either entire home or apartment, followed by private room. Some hotels are also listing their room in Airbnb apparently.

Common Property Type

Let's get the top 15 property type from the data. Property type is the more granular and more detailed version from the room type.

We can use the COUNT(*) to count the number of rows and use GROUP BY to group the data based on the property type. To get the top 15, we need to arrange the data by using ORDER BY and set the sorting to descending (from largest to smallest) using the DESC. Finally, we limit the data to take only the first 15 row using the LIMIT.

Let's create a simple bar chart to visualize this result.

Property Type by Average Price

Does apartment and condominium become popular choice because they are the cheapest option out there for the host to offer? To answer this question, let's check the top 10 and bottom 10 property type based on the average price for a property type with at least 50 different listing.

The most expensive property that is being listed is the entire villa, followed by entire house. Surprisingly, shared room in house has higher average price compared to other property. Meanwhile, on the bottom 10 the list is dominated by listing for private and shared room. However, we don't see listing for private room or entire apartment/condominium in the top and bottom 10, so we may conclude that the price for those listings is somewhere in the middle.

Property Type by Review Score

According to Airbnb, there are several star ratings available for the user to measure how good the service given by the listing. Several rating that we can use including:

First, let's check if there are significant difference in average of rating of the overall experience for certain property type. Some listing may don't have any rating yet, so we will filter data that has non-NULL rating and select only property type with at least 50 different listing.

Entire loft and Entire townhouse become the top listings with the highest overall experience, reflected by the average review scores rating. Entire apartment and Entire condominium are also present as well as the top 10 overall experience. The listings with the least overall experience belong to Room in aparthotel and Room in hotel. Despite being one of the most expensive listing, people generally have lower overall experience rating for Room in hotel compared to other listing.

Correlation Between Price and Overall Experience

Let's check whether listing with higher price will give higher overall experience as well using simple scatterplot. We will scale the listing price to log10 units. We only collect listing with non-NULL value for the review scores rating. We will also filter the data by collecting listing that has more than 10 number of reviews.

Most of the listing room has price higher than 300 bhat (2.5 in log10 scale), but there is one room that has price lower than 300 bhat and has high review scores rating. Based on the distribution, no listing room has overall experience lower than 80 for listing room with price higher than 10,000 bhat (4 in log10 scale). For price ranging from 300 bhat to 10,000 bhat, there is no correlation between price and overall experience. However, as the price goes up, people tend to be give higher rating.

Amenities

Let's continue by checking the most common amenities that is included for the given service.

The amenities is still in a long string format. To process this data, we will split the string into individual amenities indicated by the comma.

To prevent the possibility of mistyping or other free text error, we will convert all alphabets into lowercase.

Let's start collecting all amenities.

After all amenities are collected, it's time to create a dataframe that consists of the amenities and it's respective frequency. We will normalize the frequency into percentage to reflect of all listing available, what percentage of listing use the mentioned amenities.

Almost all of the available listing has Air conditioning and Long term stays allowed, followed by Essentials, Wifi, and TV. More than 60% of all listing also has Dedicated workspace and Kitchen, which may require more spaces outside the bedroom. According to Airbnb, essentials amenities including:

Essential amenities are the basic items that a guest expects in order to have a comfortable stay. So we can check the distribution of the overall experience from listing with and without Essentials amenities.

We will visualize the data using boxplot for each category.

Listing room that has Essentials included in the amenities tend to have slightly higher review scores rating based on the median (the middle line from the box) compared to listing that does not include Essentials.

Listing Map

Let's now draw map and see the distribution of the room listing based on the geopgrahic location using the folium package. We may also want to give the information about the name of the host for the popup of the map. Therefore, we need to join our listing table with the host_info table and get name host name.

Some listing may not have a host name or even the name for the listing, therefore we will modify the value. We will fill missing value from the host name and name column with the string No Host Name and No Name.

Now we will draw the map. We will also create a custom popup where we can put several information for each listing. Since visualizing a large number of marker/listing to the map will take a long time, in this part I restrict to only show the first 5,000 listings.

Host

We will continue answering the research question by looking at the host data.

Top Host by Earning

We will see who is the top host based on the total earning cumulated from his/her listings. We will use the following formula to calculate the total earning:

$$ Total\ earning = \Sigma_{i=1}^n price_i\times number\ of\ reviews_i \times minimum\ nights_i $$

Notes:

Unfortunately, we don't have the detailed data on the number of stay for each customer, so we will use the minimum nights as the number of stay so what we actually calculate is the minimum total earning. We use the number of reviews as the proxy of number of customers.

Now we will visualize the top host based on the number of listing for the start, see if some hosts has higher number of listing than others.

Only a handful of hosts has more than 5 listings. Now we will continue looking at the top 15 based on the Total Earning generated.

So, the list of host name is different from the top host by the number of listing and the top host by total earning. This indicates that a higher number of listings doesn't guarantee to give more earnings.

Does higher average price of listing from a single host correlate with higher total earning? Let's answer this questions using scatterplot.

As we can see, there is no visible pattern between average price and total earning. However, some of the most highest total earnings are generated by host with relatively low average price from his/her listings. Therefore, higher average price from a host doesn't guarantee to give him/her a higher total earning

Superhost

According to Airbnb, superhosts are experienced hosts who provide a shining example for other hosts, and extraordinary experiences for their guests.

In here we are interested to check if there is any difference in the distribution of review scores for listing owned by a superhost and normal host.

Both superhost and normal host has the same peak for the distribution, indicating that on average there is no significant difference on review scores rating or the overall experience. However, superhosts has thinner distribution, indicating that superhost tend to have higher review score due to low variability.

Now we will look at the response rate and the acceptance rate between normal host and the superhost. The following is the detail description about response rate and acceptance rate according to Airbnb:

We need to transform the response rate and acceptance rate to be a proper numeric values by removing the percentage character.

Let's see the distribution of the response rate and acceptance rate.

Based on plots above, the distribution of response rate and acceptance rate is the same, indicating that there is no signficant difference between a normal host and a superhost.

Host Verification

When you host on Airbnb, you may be asked to provide information such as your legal name, date of birth, or government ID so it can then be verified. This is a mandatory step for a host to prevent fraud and other financial crimes. Let's deep dive into what kind of informations are mostly filled and verified from a host.

Let's check if there is any host that is not verified yet.

There are a lot of host that is yet to be verified. Let's also check if there is any host that has missing value or no host_verification.

Let's check the most commonly verified identity.

Phone number and email are the most common way to verify a host and has been done by more than 80% of all hosts, followed by the governemnt id of the host.

Number of Host Joined

Let’s check the number of joined monthly over time, see if there is interesting insights from this data. We will count the number of host joined by monthly period since counting daily frequencies may be too small window of time.

Let’s visualize this data using heatmap. We will fill the number of host joined with 0 for any missing values.

Based on the heatmap, we can see that a large number of hosts started to join Airbnb from the year 2015, with the highest record of number of host joined in a single is on August 2015, September 2018, and October 2018. The number of new hosts are currently going down since April 2020, perhaps due to the global COVID-19 pandemic.

Conclusion

We have done some data analysis to better understand the information regarding the room listing and host from Airbnb in Bangkok. We have saw what is the most common room type available, is there any correlation between room price and the review score, who are the top 10 host, etc. You can continue further by building a beautiful analytics dashboard in Tableau or using any libraries from python, such as plotly dash, flask, or streamlit.

Don't forget to close your database connection if you are finished.